Plots are used for data exploration and presentation. The visualization types differ in a few ways.
1. Audience. You are the audience for data exploration visualizations. Presentation visualizations are for other scientists not familiar with the work and appear in journal articles, talks, and posters.
2. Message. Data exploration is meant to develop insights or find some underlying structure in the data. This is often the first step in an analysis (ie. the GMPD exercise). Data exploration is a wondering and iterative path. Presentation figures illustrate the end finding.
3. Explanation. Effective presentation figures need context. Context could include the data source, the pattern observed, how it relates to other data, etc. You are familiar with the data and are highlighting the main points with a visualization. This context can be included by figure annotations or in the case of articles through the caption.
4. Complexity. Presentation visualizations are less complex than exploration figures. Lower complexity could entail only showing a subset of the data, using fewer variables, or presenting the data in the expected method. Colors used in presentation figures should be distinct, color blind friendly, and in the case of articles readable when printed in black and white.
Exploration visualizations are part of a discussion, while presentation visualizations should tell a story.
Visualization path
There are plenty of guides on how to create the “best” visualization. This diagram is a good place to start when choosing your data visualizations.
Visualization Groups by Dr. Andrew Abela
If you’re plotting data to for publication, some things you should keep in mind:
For more details see Ten guidelines for effective data visualization in scientific publications by Kelleher and Wagener, 2011. At some point during your research career, you’ll likely hear about Edward Tufte’s philosophy on maximizing function over aesthetics.
RAt the end of the program, you’ll make a scientific poster which includes data visualization created in R. You can export figures in many different file types (TIFF, PNG, PDF), but what is the best method for poster figures?
Some recommendations:
pdf()) or png (png()). Journals typically have very specific figure requirements for submission.pdf()). Do not save the figure using the export button above the plot. The general syntax for saving plots is:pdf("File_Name.pdf", dimension and resolution arguements) # the function name is often the file extension
code to make figure plot
dev.off()
If you have created a blank or unreadable file, most likely you forgot to end the figure export process by running dev.off().
R vs ggplotPlotting in base R can allow the user to create highly customized plots. This customization takes time, and requires many decisions. An alternative is to use the package ggplot2 developed by Hadley Wickham based on the Grammar of Graphics written by Leland Wilkinson.
The workshops have exclusively used ggplot for visualizations because it works well with tidyverse and doesn’t require many design decisions up front. All of the figures we have made could have been made in base R instead. Given that, in most cases, either works well R users tend to have a strong preference for one or the other (ggplot vs. base R puts #TeamEdward vs #TeamJacob to shame). This section will cover some of the basics of base R.
We will walk through an example using base R and then recreate the figure using ggplot2. For even more side-by-side examples, see Nathan Yau’s blog post on Flowing Data.
RA simple plot can take many more lines of code than you expect based on the visualization. When plotting in base R you’ll use a handful of parameter settings in either par() or in the plotting related functions listed below.
Let’s create a plot of the total population by county area for 5 Midwest states (example taken from Selva Prabhakaran’s tutorial). This data is part of the ggplot2 package. We start with the basic scatter plot function plot() and then customize from there.
library(ggplot2) #l oad the package with the data
data("midwest", package = "ggplot2") # load the data, midwest is now in the working environment.
plot(y=log10(midwest$poptotal), x=midwest$area, # call the x and y values
col=as.factor(midwest$state), # point colors should be based on state
pch=19, cex=.75,# point shape and size
ylim=c(3,7), xlim=c(0,.1), # set the axis limites
las=1, # rotate the axis labels
xlab="Area", ylab=expression('Log'[10]*'(Total population)'), # label the axis
main ="Area vs population" # add a title
)
This is where the true power of plotting with base R customization shows. You can change the axis ticks and labels, add text anywhere, and even create multiple figure types in a single visualization. The most common addition to any visualization will be the legend since they are not automatically created when plotting with base R. You have to add them manually. There are a few different methods to do this, but the function legend() works in most cases. To add the legend to the plot above, run the legend() function following the plot() function.
legend("topright", col=c(1:5), pch=19,legend=levels(as.factor(midwest$state)))
The visualization would then look like this:
A grid of plots in base R can be created using parameter setting mfrow or cfrow. Base R also gives you the option to make inset or subplots like this example here where the box plot is inside the histogram.
# generate random data pulled from a normal and binomial distribution to plot
x <- rnorm(100,sd = 0.5)
y <- rbinom(100, 1, 0.5)
par(fig = c(0,1,0,1)) # set dimensions of histogram figure from bottom, left, top, and right
hist(x) # plot main figure
par(fig = c(0.07,0.35, 0.5, 1), new = T) # set dimensions of inset plot
boxplot(x ~ y) # plot inset
The layout() function allows the user to create multipanel plots of different sizes, like this:
# One figure in row 1 and two figures in row 2
# row 1 is 1/3 the height of row 2
# column 2 is 1/4 the width of the column 1
attach(mtcars)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE),
widths=c(3,1), heights=c(1,2))
hist(wt)
hist(mpg)
hist(disp)
If you’re interested in other customization in base R check out Paul Murrell’s R Graphics book.
The same exact scatter plot of county area vs populations size can be made using ggplot. Here the legend is automatically created. Check out the source of the example, which also has a compiled list of 50 different visualizations along with the code here.
For more detailed examples, check out the R Graphics Cookbook by Winston Chang.
# load package and data
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.
# midwest <- read.csv("http://goo.gl/G1K41K") # bkup data source
# Scatterplot
gg <- ggplot(midwest, # data, every arguement after this is connected with a '+' instead of a ','
aes(x=area, y=log10(poptotal))) + # set the x and y col in data
geom_point(aes(col=state)) + # put a point at the (x,y) value, color it by state col
xlim(c(0, 0.1)) + # set x axis limits
labs( # name the different parts of the plot
subtitle="Area Vs Population",
y="Population",
x="Area",
title="Scatterplot",
caption = "Source: midwest")
plot(gg) # plot the object
Add-ons to R code allow for interactive and/or animated plots. With these plots you can rotate, size, and select data points which can make data exploration a bit easier. We won’t have time to cover these plots as part of the workshop, but here are a few examples.
The plotly package is an add on to ggplot2 for quick interactive plots. The package is still relatively new and is under current development. The legends or other features are often poorly displayed but the interactive feature maybe useful for data exploration during an in-person meeting.
Below is an example from the plotly website. You’ll notice the syntax is similar to ggplots but the functions have changed a bit.
library(plotly)
p <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length,
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)', line = list(color = 'rgba(152, 0, 0, .8)', width = 2))) %>%
layout(title = 'Styled Scatter', yaxis = list(zeroline = FALSE), xaxis = list(zeroline = FALSE))
p #plot the interactive graphic
plot_ly(z = volcano, type = "surface") #simple example of 3D surface plot
The googleVis package also has some great plots with tool tips built in.
Shiny is an easy introductory tool to more novel data visualizations, and can make data dash boards for data exploration (eg. HealthMap).
The plotly package also allows for quick animations like this.
suppressMessages(library(gganimate))
birth<-read.csv("birth.csv", sep='', header=TRUE)
pal <- c("#313695","#4575b4","#74add1","#abd9e9","#e0f3f8","#ffffbf","#fee090","#fdae61","#f46d43","#d73027","#a50026")
vals <- seq(10,32, length = 11)
birth <- ggplot(birth, aes(x = Year, y = BirthRate, frame = Year, cumulative = TRUE)) +
geom_line(colour="black") +
geom_point(shape = 21, colour="black", aes(fill=BirthRate), size=5, stroke=1) +
scale_x_continuous(limits=c(1880,2015)) +
scale_y_continuous(limits=c(10,32)) +
theme_minimal() +
scale_fill_gradientn(colors = pal, values = vals, rescaler = function(x, ...) x, oob = identity, guide=FALSE) +
xlab("Year") +
ylab("Birth rate")
p<-gganimate(birth, "birth.gif", ani.width = 750, ani.height = 500, interval = 0.1)
Netherlands birth rate
Just like other data visualizations, mapping in R can be done a few different ways. Common packages include:
mapsrMapsmapdataggmapchoroplethrrastersprgdalThe last 3 (raster, sp, and rgdal) are also useful for analyzing spatial data.
The choroplethr package is useful for plotting U.S. county level data like this:
library(ggplot2)
library(choroplethr)
library(choroplethrMaps)
library(mapproj)
data(df_county_demographics)
df_county_demographics$value = df_county_demographics$percent_hispanic
county_choropleth(df_county_demographics,
state_zoom = "texas",
title = "Texas County Percent Hispanic 2012 Estimates",
num_colors = 9) + coord_map()
Newer packages like googleVis or leaflet also have built-in interactive features.
library(leaflet)
m <- leaflet() %>%
addTiles() %>%
addMarkers(lng=174.768, lat=-36.852, popup="The birthplace of R")
m # Print the map
Some additional resources:
When we’ve explored a new dataset in the workshops, one of the first steps is calculating descriptive statistics (ie. summary(),mean(), sd(), etc.). However, we have not touched on inferential statistics which is concerned with the evidence for or against certain ideas. Inferential statistics is the branch of statistics that deals with hypothesis tests. If you have taken courses in probability and statistics, you may recall concepts such as p-values, F statistics, z-scores, and confidence intervals. The workshop will cover t-tests, correlations and linear regressions a subset of inferential statistics. Below is a conceptual review of these methods.
Of course, R is able to calculate these things and, indeed, a great many other quantities that are useful for scientific inference as well. If your research requires more extensive inferential statistics, Modern Applied Statistics with S by Venables and Ripley1 is a good resource for some technical background and implementation.
To start, we’ll consider the problem of testing for a difference of means. The problem is this: suppose we have two different sets of observations and we would like to know if these sets are different from each other, or if the numerical differences we see are simply based on sampling error. If you recall the GMPD from the second workshop, we wanted to know which group had the highest parasite prevalence. Suppose we wanted to ask if parasite prevalence differed between carnivores and primates. (Note we are only asking if they are different, not why they are different. If we find that they are in fact different, there still may be many reasons for this: one group may have higher diversity, or easier to sample, etc.)
library(tidyverse)
gmp <- read_csv("data/GMPD_main.csv") %>% drop_na(Prevalence)
gmp %>%
filter(Group != "ungulates") %>%
group_by(Group) %>%
summarise(MeanPrev = mean(Prevalence)) %>%
ggplot(aes(x = Group, y = MeanPrev)) +
geom_bar(stat = "identity")
Conventionally, we will say that “no difference” is the null hypothesis and “there is a difference” is the alternate hypothesis. We seek to “reject” the null hypothesis and the traditional statistical test is the t-test. In R, a t-test may be performed using the function t.test(). Thus, to test for a difference of means between carnivores and primate prevalence, we issue the following command:
carnivors <- gmp %>% filter(Group == "carnivores")
primates <- gmp %>% filter(Group == "primates")
t.test(carnivors$Prevalence, primates$Prevalence)
##
## Welch Two Sample t-test
##
## data: carnivors$Prevalence and primates$Prevalence
## t = 3.9955, df = 12251, p-value = 6.494e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03063570 0.08964417
## sample estimates:
## mean of x mean of y
## 0.3881653 0.3280254
Note the t.test reuires the measured values from the two populations to be compared (not the mean values)
As you see, the result consists of a list of different quantities. We will focus on just two. First, the 95% confidence interval on the difference of means consists of two numbers – a lower confidence limit and an upper confidence limit. This interval has the interpretation that were this operation to be repeated multiple times with different data sets, in 95% of cases the reported interval would contain the true value of the quantity of interest, in this case the difference between the (true, not estimated) mean parasite prevalence of carnivores and primates. This is a technical definition, which you may be familiar with if you have studied probability and statistics. If you have not studied probability and statistics you can roughly interpret the result this way: any of the values in the interval are broadly consistent with the observed data. Inspecting this confidence interval, we see that it is possible that the true average parasite prevalence in carnivores exceeded the average parasite prevalence in primates by as much as 0.09, that the average parasite prevalence of carnies exceeded the average parasite prevalence of primates by as much as 0.03, and everything in between. Since zero is not on the interval from 0.03 to 0.09, we fail to reject the null hypothesis that there was no difference in the average number of cases in these two districts.
This conclusion is underscored by the p-value, in this case \(p=0.65 \times 10^{-4}\). The p-value can be interpreted this way. Suppose the null hypothesis were true, given the sampling variation inherent in these data, what is the probability that a difference of means this great or greater would be observed? For our example, the probability is \(\approx 0.000 06\), a 0.006% chance. One says that a result is statistically significant only when the probability of observing such a case by chance is very small, which we name with the Greek letter \(\alpha\) and call the significance level. Conventionally, \(\alpha\) is chosen to be 5% or 0.05, so we would say ``this test fails to reject the null hypothesis of no difference at the \(\alpha=0.05\) level of significance’’.
The video covers the calculations that are needed to derive the confidence interval, and p-value.
Two further techniques for statistical inference are linear regression and correlation. Correlation considers the level of association between two variables. As an example, the following three plots show three pairs of observations \(x_1\) and \(x_2\) that differ in their degree of correlation. (Traditionally, the correlation coefficient is named with the Greek letter \(\rho\).)
require(mvtnorm) # require is an alternative to `library()` that is best used inside of functions.
sigma1 <- matrix(c(1, 0, 0, 1), nrow=2, byrow=TRUE)
sigma2 <- matrix(c(1, 0.5, 0.5, 1), nrow=2, byrow=TRUE)
sigma3 <- matrix(c(1, 0.9, 0.9, 1), nrow=2, byrow=TRUE)
set.seed(10281979)
x1 <- rmvnorm(50, mean=c(0,0), sigma=sigma1)
x2 <- rmvnorm(50, mean=c(0,0), sigma=sigma2)
x3 <- rmvnorm(50, mean=c(0,0), sigma=sigma3)
par(mfrow=c(1,3))
plot(x1, main=paste('rho=',round((cor(x1[,1],x1[,2])),2)))
plot(x2, main=paste('rho=',round((cor(x2[,1],x2[,2])),2)))
plot(x3, main=paste('rho=',round((cor(x3[,1],x3[,2])),2)))
From one point of view, correlation is just a descriptive statistic and any two vectors of numbers will have some correlation coefficient. However, in the traditional hypothesis testing way, we can also express the null hypothesis of no correlation (\(H_0: \rho=0\)), which we then seek to reject. In R, this is done using the function cor.test(), which returns a 95% confidence interval and a p-value, just as t.test() did to test for a difference of means.
Note The default measure of association returned by cor.test() is called Pearson’s correlation coefficient. However, there are actually several different conceptions of correlation which differ in technical ways. Another of these is called Spearman’s rank-order coefficient and may be calculated by using the argument method='spearman'.
Linear regression is closely related to Spearman’s correlation and also considers the relationship between two variables. Specifically, linear regression assumes that the relationship between two quantities can be represented by a line (which is expressed with the equation \(y=mx_i+b\)) and that the failure of the data to fall exactly on a line is because of some variation or measurement error in the variable \(y\) such that the \(i^{th}\) observation \(y_i\) can be expressed as \(y_i=mx+b+\epsilon_i\), where \(\epsilon_i\) is the measurement error. Now, the problem is to find the values of \(m\) and \(b\) (the slope and intercept) such that the total error is minimized in some sense and possibly also to test hypothesis (for instance that \(m=0\) or \(b=0\)). In ordinary (least squares) regression, we seek to minimize the sum of squared errors: \(\Sigma_i (y_i-(mx_i+b))^2\). Since this is a linear model, the best fitting values of \(m\) and \(b\), as well as a lot of other information, may be obtained using the function lm().
To illustrate, we draw on a piece of epidemiological theory. By way of background, theoretical epidemiologists are often concerned with a key quantity called the basic reproduction ratio, designated \(R_0\). \(R_0\) is defined as the average number of secondary cases that will arise by contagious infection from an index patient in a wholly susceptible population. \(R_0=1\) is known as a critical point because infectious diseases with \(R_0>1\) almost always result in large epidemics whereas infectious diseases with \(R_0<1\) rapidly die out. (Do you see why this is, given the definition?) It can be shown (using arguments not reproduced here) that for a rapidly spreading epidemic
\[\begin{equation*} \log{Y_t}\;\approx\;\log{Y_0}+(R_0-1)\,(\gamma)\,t, \end{equation*}\]where \(Y_t\) is the number of individuals currently infected at time \(t\) and \(\gamma^{-1}\) is the infectious period. This implies that a semi-log plot of \(Y_t\) vs \(t\) should be approximately linear with a slope proportional to \(R_0-1\) and the recovery rate. If we plot the influenza data, we see that this is indeed the case.
flu <- read.csv("data/flu.csv") # load data that is two columns (day, flu)
plot(flu, type = 'b', log = 'y', xlab = 'Day', ylab = 'Number of individuals infected')
Ultimately, then, linear regression will allow us to estimate \(R_0\). However, to do this we need to note a few things. First, it is the logarithm of the number of cases on day \(t\) that takes the place of \(y\) in our expression \(y=mx+b\). In R the logarithm is calculated using the function log. Next, \(log(Y_0)\) takes the place of \(b\). If we assume that the epidemic was started by a single infectious individual, then \(b=log(1)=0\), so we’re interested not in any best fitting line, but only that best fit line that also goes through \((x=0, y=0)\). In regression analysis, this is called fitting a model without an intercept. Third, if we interpret \(t\) as the quantity that take the place of \(x\) in \(y=mx+b\), then we see that it is the entire expression \((R_0-1)\,(\gamma)\) that forms the slope \(m\). Finally, the formula only applies to the epidemic takeoff, so we only want to fit a subset of our data. Inspecting the plot suggests that we might fit our line to data for all days up to and including day 5 of the outbreak. Now, we fit the line using lm() as follows:
model <- lm(formula = log(flu) ~ day - 1 ,data = subset(flu, day <= 5))
Before looking at the output, there are a number of things to notice about this command.
First, the results are stored in an object that I have called model.
Second, the first argument (called formula) is nothing like any argument we’ve seen before. It looks sort of like a mathematical expression. the first term says that the quantity which we want to take the place of \(y\) in our formula \(y=mx+b\) is the logarithm of something called flu. (But where does flu come from? More on that in a moment.)
Next we see the character tilde (\(\sim\)). This may be read as “distributed as” or you may more simply think of this as taking the place of the equals sign in our formula \(y=mx+b\).
Finally, to the right of tilde we see information about what is on the right hand side of the equation. First, there is day, which takes the place of \(x\). (But where does day come from? Answer: the same place as flu.) Because day is a variable, R assumes you wish to find a slope parameter to multiply by day. Finally, we also see that there is a \(-1\), which is a coded way of telling R that we wish to fit only equations that go through the origin. If we left this off, R would seek to find a combination of \(m\) and \(b\) that minimized the sum of squared errors. As it is, R will fix the intercept at zero and return only a slope coefficient. Now, where does R look to find the variables flu and day? This is provided by the argument data. The possible values for data are data frames that contain variables called flu and day. If we supplied the name of a data frame that did not contain these variables R would return an error. In our case, however, we’re not interested in using all the data in flu, but only those records for which day is less than or equal to 5, which we obtain using the subset function as indicated. This is a base R alternative to filter().
Now we wish to see what our fit model consists of. For starters, we can simply type the name of the object at the command line.
model
##
## Call:
## lm(formula = log(flu) ~ day - 1, data = subset(flu, day <= 5))
##
## Coefficients:
## day
## 1.083
This, we see, prints to the screen the formula that was used to fit the model as well as the fit coefficients, in our case \(m=1.086\). But, there is more information that we can obtain. Here we come to the other common use of summary(). If we ask for a summary of the model, we see that we get a lot more information: the formula; a list of residuals (the \(\epsilon_i\) from above); more information about the coefficient \(m\), including the estimate, an estimate of its error (i.e. a value that when we add and subtract to the estimate provides an interval – though not a 95% confidence interval), something called the ``t-value’’ (which we won’t worry about more), and finally a p-value for the hypothesis test that the coefficient is equal to zero. Also, we get some additional summary statistics such as the \(R^2\) value (fraction of variance explained by the model) and other statistical details.
summary(model)
##
## Call:
## lm(formula = log(flu) ~ day - 1, data = subset(flu, day <= 5))
##
## Residuals:
## 1 2 3 4 5
## 0.015150 -0.087483 0.081817 -0.003116 -0.014634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## day 1.083462 0.008202 132.1 1.97e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06083 on 4 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9997
## F-statistic: 1.745e+04 on 1 and 4 DF, p-value: 1.97e-08
Finally, then, we are in a position to calculate \(R_0\). Recalling from before, we understand that there is an equivalence between \(m\) in the equation \(y=mx+b\) and the parameter combination \((R_0-1) \gamma\) so we write \(m=(R_0-1) \gamma\). Given that the infectious period of influenza is about 2.5 days, we conclude that \(\gamma \approx 2.5^{-1} = 0.4\). Rearranging, we have \(\hat{R_0} = m/ \gamma +1 = 1.0895 / 0.4 + 1 \approx 3.7\) (writing \(R_0\) with a ``hat’’ to signify that this is an estimate).
[S is statistical computing language](https://en.wikipedia.org/wiki/S_(programming_language) and predecessor of R. The code implementation examples in the book are in base R.↩